Hierarchical Indexing¶

Up to this point we’ve been focused primarily on one-dimensional and two-dimensional data, stored in Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional data—that is, data indexed by more than one or two keys. While Pandas does provide Panel and Panel4D objectsthat natively handle three-dimensional and four-dimensional data, a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index

In [1]:
#Importing libraries

import pandas as pd
import numpy as np

Let’s generate random data from the normal distribution.

In [2]:
data=pd.Series(np.random.randn(8),index=[["a","a","a","b","b","b","c","c"],[1,2,3,1,2,3,1,2]])
data
Out[2]:
a  1   -0.799629
   2    1.449937
   3    1.772006
b  1    0.703102
   2    0.631890
   3   -1.971740
c  1   -1.493603
   2    0.419737
dtype: float64

What is MultiIndex?¶

MultiIndex allows you to select more than one row and column in your index. To understand MultiIndex, let’s see the indexes of the data.

In [3]:
data.index
Out[3]:
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 2),
            ('b', 3),
            ('c', 1),
            ('c', 2)],
           )

MultiIndex is an advanced indexing technique for DataFrames that shows the multiple levels of the indexes. Our dataset has two levels. You can obtain subsets of the data using the indexes. For example, let’s take a look at the values with index a.

In [4]:
data["a"]
Out[4]:
1   -0.799629
2    1.449937
3    1.772006
dtype: float64
In [9]:
#slicing can also be done on multiindexes
data["b":"c"]
Out[9]:
b  1    0.703102
   2    0.631890
   3   -1.971740
c  1   -1.493603
   2    0.419737
dtype: float64
In [7]:
#We can also look more than one index
data.loc[["a","c"]]
Out[7]:
a  1   -0.799629
   2    1.449937
   3    1.772006
c  1   -1.493603
   2    0.419737
dtype: float64

You can select values from the inner index. Let’s take a look at the first values of the inner index.

In [8]:
data.loc[:,1]
Out[8]:
a   -0.799629
b    0.703102
c   -1.493603
dtype: float64

What is the unstack?¶

The stack method turns column names into index values, and the unstack method turns index values into column names. You can see the data as a table with the unstack method

In [10]:
data.unstack()
Out[10]:
1 2 3
a -0.799629 1.449937 1.772006
b 0.703102 0.631890 -1.971740
c -1.493603 0.419737 NaN

To restore the dataset, you can use the stack method.

In [11]:
data.unstack().stack()
Out[11]:
a  1   -0.799629
   2    1.449937
   3    1.772006
b  1    0.703102
   2    0.631890
   3   -1.971740
c  1   -1.493603
   2    0.419737
dtype: float64

Hierarchical Indexing in The Data Frame¶

You can move the DataFrame’s columns to the row index. To show this, let’s create a dataset.

In [12]:
data=pd.DataFrame({"x":range(8),"y":range(8,0,-1),"a":["one","one","one","one","two","two","two","two"],"b":[0,1,2,3,0,1,2,3]})
data
Out[12]:
x y a b
0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
4 4 4 two 0
5 5 3 two 1
6 6 2 two 2
7 7 1 two 3

Let’s transform columns a and b of this dataset into a row index.

In [13]:
data2=data.set_index(["a","b"])
data2
Out[13]:
x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1

In the set_index method, the indexes moved to the row are removed from the column. You can use drop = False to remain the columns you get as an index in the same place.

In [14]:
data3=data.set_index(["a","b"],drop=False)
data3
Out[14]:
x y a b
a b
one 0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
two 0 4 4 two 0
1 5 3 two 1
2 6 2 two 2
3 7 1 two 3
In [15]:
data2
Out[15]:
x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1

You can use the reset_index method to restore the dataset.

In [16]:
data2.reset_index()
Out[16]:
a b x y
0 one 0 0 8
1 one 1 1 7
2 one 2 2 6
3 one 3 3 5
4 two 0 4 4
5 two 1 5 3
6 two 2 6 2
7 two 3 7 1
In [ ]: